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PREFACE 

The  work  described  in  this  publication  was  performed  by  the  Jet 
Propulsion  Laboratory,  an  operating  division  of  the  California 
Institute  of  Technology,  under  contract  NAS7-918,  RE182,  A187  with  the 
National  Aeronautics  and  Space  Administration,  for  the  United  States 
Army  Intelligence  Center  and  School. 

This  revised  edition  replaces  the  original  report,  published  on 
June  26,  1987,  and  is  being  re -published  because  the  original  edition 
contains  photocopy  blotches  and  illegible  print. 
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EXECUTIVE  SUMMARY 


This  Technical  Memorandum  was  prepared 
originally  as  part  of  the  Generic  Fix  Report  (FY-86) 
which  was  elmininated  under  the  FY-87  statement  of 
work  (SOW  £2),  undated  (delivered  to  JPL  19  November 
1986)  . 

The  purpose  of  the  Generic  Fix  Report,  of  which 
this  paper  was  to  be  an  appendix,  was  to  collect  all 
the  material  needed  to  understand  Direction  Finding 
and  Fix  Estimation  and  their  mathematical  basis  in 
one  volume  to  support  the  multi-volume  series  of  Fix 
Estimation  Reports. 

This  paper  is  being  published  because  it  was 
compeleted  in  FY-86  with  FY-86  funds  and  was  being 
held  for  integration  into  the  Generic  Fix  Report. 
It  will  be  of  value  to  readers  desiring  to  persue 
the  mathematics  involved  in  the  Fix  Estimation  Reports. 
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California  Institute  of  Technology 


Pasadena,  California 


Fundamentals  of  Linear  Estimation 


0.  Introduction 

This  memorandum  will  treat  a  selection  of  topics  in  linear 
estimation,  beginning  with  a  very  simple  situation  and  progressing 
through  more  complications.  These  topics  are  shown  in  A.  below.  This 
progression  will  provide  the  main  structure  for  the  document.  Topics 
from  B.  and  C.  below  will  be  brought  in  at  points  where  their  need  has 
been  motivated. 

A  more  conventional  academic  approach  to  covering  this  material 
would  be  to  discuss  B.  and  C.  first,  to  establish  the  foundation,  and 
then  treat  A.  This  approach  would  be  easier  to  do  but  would  lack 
motivation  in  the  early  stages.  We  are  trying  the  stated  approach  on 
the  assumption  that  the  sponsor  wants  a  more  motivated  presentation. 

'•  A.  Topics  in  estimation", 


1.  One  dimensional  estimation  with  one  observation, 

2.  Same  with  two  observations. 

3.  With  same  or  different  observational  errors, 

4.  Any  number  of  observations. 

5.  Two  dimensional  estimation,  independent  errors. 

6.  Multidimensional  estimation,  J 

7.  Combining  sets  of  observations  1 

B,  Properties  of  random  distributions’ 

1.  Mean,  first  moment. 

2.  Standard  deviation,  variance,  second  moment, 

3.  Distribution,  frequency  function,  all  moments.  - 

Normal  distribution. 
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Chi  squared  distribution. 
Student's  t  distribution", 
F  distribution. 

Confidence  intervals i 


C _  Least  squares  . 


Statement,  geometric  interpretation.  — 

Gradient  of  the  sum  of  squares, 

Solution  methods.  ~ 

Orthogonal  transformations^  * 

Using  the  covariance  matrix  of  observation  errors. 


The  general  style  of  the  paper  is  tutorial,  however  due  to  the 
amount  of  material  being  covered,  and  to  avoid  reaching  book- length,  it 
will  be  more  of  a  sketch  of  a  tutorial  rather  than  a  true  tutorial  in 
some  places. 


Equations  will  be  numbered  within  sections.  For  example,  if  there 
is  an  Equation  (2)  in  Section  4.1,  it  will  be  referenced  as  Eq.(2)  from 
within  Section  4.1  and  as  Eq.  4.1.  (2)  from  any  other  section. 

1.  One  dimensional  estimation 

1.1.  A  single  observation 

Suppose  one  makes  a  one  dimensional  observation,  such  as  the 
distance  between  two  stakes  at  a  construction  site.  Suppose  the 
measured  distance  is  95.36  meters,  with  an  uncertainty  of  2  centi¬ 
meters  . 

For  the  moment  we  shall  not  define  this  concept  of  uncertainty  too 
precisely.  One  way  to  think  of  measurement  uncertainty  is  in  terms  of 
how  surprised  we  would  be  at  different  possible  outcomes  if  we  could 
somehow  later  determine  the  distance  much  more  accurately.  We  would  be 
surprised  if  our  error  turned  out  to  be  3  cm. ,  and  very  surprised  if  it 
was  4  cm. ,  and  not  at  all  surprised  if  it  was  just  1  cm.  If  the  error 
turned  out  to  be  greater  than  about  6  cm.  we  would  probably  check  to 
see  if  there  was  a  blunder  in  the  first  measurement  or  if  the  stakes 
had  moved. 

1.2.  Two  observations 

Suppose  we  make  a  second  observation  of  this  same  distance  and 
obtain  95.37  meters,  again  with  an  uncertainty  of  2  cm.  What  is  our 
best  estimate  of  the  true  distance? 

If  we  use  the  principle  of  least  squares,  which  we  will  not 
justify  at  this  point,  we  seek  a  number,  x,  such  that  the  sum  of 
squares  of  residuals  between  x  and  the  observed  values  is  minimized. 
Thus  denoting  the  measurements  by  b:  -  95.36  and  b2  -  95.37,  we  seek  x 
to  minimize 

s  -  (x  -  bj)2  +  (x  -  b2)2 

We  may  differentiate  s  with  respect  to  x,  obtaining 
ds/dx  -  2(x  -  bx  +  x  -  b2) 
which  will  have  the  value  zero  when 

x  -  (bj  +  b2)/2 

i.e.,  when  x  is  the  average  or  mean  of  bj  and  b2.  Thus  our  estimate  of 
the  distance  being  measured  is  95.365  meters. 

What  estimate  of  uncertainty  do  we  attach  to  this  result?  To 
answer  this  we  shall  need  to  adopt  a  mathematical  model  of  uncertainty, 
but  before  doing  this  we  shall  introduce  one  more  example. 


1.3.  Two  observations  with  differing  precision 


Suppose  we  have  the  first  observation  as  before  but  the  make  a 
second  observation  of  this  same  distance  using  a  more  precise  measuring 
device,  obtaining  a  distance  of  95.372  meters,  with  an  uncertainty  of 
0.2  cm.  Now  the  simple  average  of  the  two  measurements  no  longer  seems 
like  a  reasonable  estimate.  We  should  somehow  give  more  weight  to  the 
more  accurate  measurement. 

A  reasonable  way  to  do  this  is  to  scale  the  residuals  for  the  two 
measurements  so  that  equal  values  of  the  scaled  residuals  correspond  to 
equal  levels  of  surprise.  Specifically,  instead  of  the  simple  resid¬ 
uals,  (x  -  bj)  and  (x  -  b2) ,  we  will  use  the  scaled  residuals,  (x  - 
b1)/d1  and  (x  -  b2)/d2,  where  d1  denotes  the  uncertainty  in  the 
measurement  bx .  Thus,  for  example,  if  (x  -  b1)/d1  has  the  value  1.2, 
this  engenders  the  same  level  of  surprise  as  would  be  associated  with 
(x  -  b2)/d2  having  the  value  1.2. 

The  combined  error  function  we  shall  now  seek  to  minimize  is 
s  -  [ (x  -  b1)/d1]2  +  [(x  -  b2)/d2]2 
Differentiating  with  respect  to  x  we  obtain 

ds/dx  -  2 [ (x  -  b1)/d1  +  (x  -  b2)/d2] 
which  will  have  the  value  zero  when 

x  -  (b1/d1  +  b2/d2)  /  (l/d2  +  l/d2) 

Using  the  values,  bj  -  95.36,  dx  -  0.02,  b2  -  95.372,  and  d2  - 
0.002,  we  obtain  the  estimate,  x  -  95.3709.  Note  that  with  this 
estimate  the  simple  residuals  are 

x  -  bx  -  0.0109 

and 

x  -  b2  -  -0.00109 

whereas  the  scaled  residuals  have  equal  magnitudes  of 
I  (x  -  b^/dj  -  |(x  -  b2)/d2|  -  0.545 

Looking  on  to  larger  problems ,  we  remark  that  although  least 
squares  estimation  has  a  tendency  to  balance  the  magnitudes  of  scaled 
residuals,  the  actual  data  and  dimensionality  of  a  problem  limits  how 
closely  this  balance  can  be  approached,  and  in  general  one  can  not  hope 
for  exact  balancing  as  was  attained  in  this  example. 

Now  we  must  develop  a  mathematical  model  for  uncertainties. 

2.  Characterizing  random  distributions 

Consider  again  our  first  example  in  which  we  assumed  the  uncer¬ 
tainty  of  the  measurement  was  2  cm.  Suppose  we  repeat  this  measurement 
1000  times  and  count  the  number  of  times  the  difference  from  our 
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origirji  measurement  falls  in  selected  ranges,  such  as  -4  cm.), 

(-4,  2),  (-2,  0),  (0,  2),  (2,  4),  and  (4,  ®)  .  If  we  repeated  this 
inurement  another  1000  times  we  would  expect  some  stability  in  the 
percentage  of  measurement  differences  falling  in  each  one  of  our  bins. 
For  example  we  would  expect  the  percentage  of  measurement  differences 
falling  in  (0,  2  cm.)  would  hover  around  some  fixed  value,  say  34%. 

A  mathematical  model  that  is  useful  in  analyzing  this  type  of 
behavior  is  the  assumption  that  there  is  a  nonnegative  continuous 
function,  f,  defined  for  all  real  numbers,  and  related  to  our  experi¬ 
ment  by  the  condition  that  the  area  under  the  graph  of  f  between  any 
two  points  a  and  b  gives  the  value  to  which  this  type  of  repeated 
experimenting  and  counting  converges.  Thus,  such  a  function,  f, 
relating  to  our  experiment  would  need  to  have  area  between  0  and  2  of 
0.34. 

The  usual  statistical  terminology  is  to  call  a  function,  f,  a 
frequency  function  or  a  probability  density  function  if  it  is  nonnega¬ 
tive  and  its  integral  from  -®  to  +®  exists  and  has  the  value  1.  We 
will  only  be  concerned  with  frequency  functions  that  are  continuous,  or 
at  most  have  jump  discontinuities  at  a  finite  number  of  points. 


The  indefinite  integral  of  a  frequency  function  is  called  a 
distribution  function.  Thus  from  a  frequency  function,  f,  we  obtain  a 
distribution  function,  F,  defined  by 

F(t)  “  /.l  f(s)  ds 


A  distribution  function  is  defined  for  all  real  numbers,  is 
continuous  and  monotone  nondecreasing.  It  approaches  the  limiting 
value  of  0  as  its  argument  approaches  -<*>,  and  1  as  its  argument  ap¬ 
proaches  +<*>. 

The  term,  random  variable,  ic  commonly  used  to  refer  to  a  quanti¬ 
ty,  such  as  the  measurement  error  in  our  example,  that  typically  has  a 
different  unpredictable  value  each  time  it  is  observed,  but  yet 
exhibits  some  regularity  with  regard  to  the  distribution  of  its  values 
in  a  large  number  of  observations.  Note  that  we  are  not  actually 
giving  a  definition  of  the  term,  random  variable. 

The  closest  we  can  come  to  giving  a  mathematical  definition  of  the 
term,  random  variable,  is  to  say  that  the  statement,  "x  is  a  random 
variable  with  probability  density  function  f"  means  that  certain 
stylized  statements  involving  "x"  are  to  be  taken  as  meaning  something 
specific  about  "f".  As  an  example  of  such  a  statement,  note  that  "the 
probability  that  x  exceeds  2  is  0.02",  which  may  also  be  expressed  as 
"P(x  >  2)  -  0.02",  means  "the  integral  of  f  from  2  to  +«  is  2.’li~x'u 

It  is  often  convenient  to  use  the  term,  distribution .  as  a 
linguistic  aid  in  associating  the  name  of  a  random  variable  with  the 
name  of  its  frequency  function.  For  example  we  may  at  some  point  let  D 
denote  a  random  distribution  with  frequency  function,  f,  and  later  say 
that  x  is  a  sample  from  D. 
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In  practice  one  almost  never  has  enough  information  to  determine  a 
frequency  function  from  empirical  data.  Various  assumptions  are 
typically  made  to  fill  this  gap. 

For  some  purposes  it  suffices  just  to  assume  that  there  is  some 
frequency  function  underlying  the  random  aspects  of  a  problem  and 
compute  estimates  of  certain  attributes  of  the  distribution,  most 
commonly  the  mean,  which  is  a  measure  of  the  central  location  of  the 
distribution,  and  the  standard  deviation,  which  is  a  measure  of  the 
dispersion  of  the  distribution. 

When  one  wishes  to  go  further  and  make  statements  involving 
probabilities,  it  becomes  necessary  to  base  the  analysis  on  some 
specific  frequency  function.  There  are  a  number  of  frequency  functions 
that  have  been  thoroughly  studied  by  statisticians,  so  in  practice  one 
usually  picks  one  of  these  well  known  functions  that  has  a  plausible 
shape  for  the  problem. 

The  mean  of  a  distribution  with  frequency  function,  f,  is  defined 
by 

p  -  s  f  (s)  ds 


while  the  standard  deviation,  a,  is  defined  by 

°2  ~  '  >*) 2  ds 


The  squared  quantity,  a2,  is  called  the  variance  of  the  distri¬ 
bution  . 

It  is  useful  to  have  notations  for  these  concepts  for  use  with  the 
"random  variable"  terminology.  Thus  the  mean  value  of  a  random 
variable,  x,  is  also  called  the  expected  value  of  x,  denoted  by  ECx) . 
This  notation  is  extended  to  apply  to  arbitrary  functions  of  a  random 
variable.  Thus  if  g(x)  is  any  function  of  a  random  variable,  x,  and 
the  following  integral  exists,  we  may  write 


E(g(x) )  -  g(s)  f (s)  ds 

The  expected  value  operator  is  a  linear  operator,  in  the  sense 
that  for  arbitrary  scalars,  a  and  P,  and  functions,  g  and  h,  for  which 
the  required  integrals  exist,  we  have 

E (a  g (x)  +  p  h(x) )  -  a  E(g(x))  +  p  E(h(x)) 


It  is  useful  to  introduce  another  operator,  Var(x) ,  to  capture 
this  last  expression.  Thus  we  define 

Var(x)  -  E(  (x  -  E(x))2  ) 

The  value  of  the  Var  operator  is  insensitive  to  an  additive  shift 
of  the  underlying  distribution  and  varies  with  the  square  of  a  multi¬ 
plicative  factor.  Thus 

Var(ax  +  p)  -  Var(ax)  -  a2  Var(x) 

3.  Examples  of  the  use  of  the  mean  and  standard  deviation 
3.1.  Example  assuming  o  is  known 

Returning  to  the  example  of  Sec.  1.2.,  let  us  model  the  uncer¬ 
tainty  in  the  measurement  process  by  assuming  the  observed  values  bj 
and  b2  are  independent  random  samples  from  some  distribution  with 
frequency  function,  f,  having  mean,  ft,  and  standard  deviation,  a.  We 
assume  that  f  and  ft  are  not  known,  but  we  make  the  rather  strong 
assumption  that  a  is  known  to  have  the  value  2  cm.  Our  goal  is  to 
estimate  ft  and  obtain  an  estimate  of  the  standard  deviation  of  the 
estimated  value. 

The  estimation  function  used  in  Sec.  1.2.  was  the  simple  average 

g(bltb2)  -  (bx  +  b2)/2 

It  will  be  instructive  to  consider  a  slightly  more  general 
estimator  function,  namely 

h(b1,b2)  -  abj,  +  £b2 

and  then  show  that  the  choice  of  o  -  fi  -  1/2  has  certain  desirable 
properties . 

Regarding  bj  and  b2  as  independent  random  samples  from  our  assumed 
distribution,  the  function  h  defines  a  new  random  variable  having  a 
different  distribution.  What  are  the  mean  and  standard  deviation  of 
this  derived  distribution?  What  we  hope,  if  h  is  to  be  of  reasonable 
use  as  an  estimator,  is  that  the  mean  of  h  is  p,  or  has  a  known 
functional  relationship  to  ft,  and  the  standard  deviation  of  h  is  less 
than  a,  so  we  are  estimating  the  quantity  of  interest,  ft,  and  with 
dispersion  less  than  that  of  a  single  observation. 

We  must  generalize  the  definitions  given  previously  for  the 
operators  E()  and  Var(),  because  h  depends  on  two  random  variables. 


We  compute  the  mean  of  h  as 


ECMW)  -  SI  <abl  +  0b2>  f(bl)  f(b2>  dbl  db2 

-  a/  bx  f(bj)  dbj  f  f(b2)  db2 

+  pf  f(b1)  dbj  f  b2  f(b2)  db2 

-  ap  +  Pn  -  (o  +  p)p 
and  the  variance  of  h  as 

Var(h(b:>b2))  -  E{ [h  -  E(h)j2} 

-  E { (ab1  +  pb2  -(a  +  p)p]2) 

-  E{  [a(b1  -  p)  +  p(bz  -  p)]2) 

-  Q2E[(bx-n)2}  +  £2E[(b2-M)2]  +  2a^E[(b1-/i)(b2-/i)] 

-  a2Var(b1)  +  02Var(b2) 

+  2ap$J  (b x-p)(b2-p)  f(bx)  f(b2)  db,  db2 

-  (q2  +  p2 ) a2 

+  2apJ  (br/<)  f(b2)  dbj  f  (b2-M)  f(b2)  db2 

-  (a2  +  P2)a2  +  0  -  (a2  +  p2)a2 

For  the  simple  average  estimator,  g,  where  a  -  P  -  1/2,  these 
formulas  give  a  mean  value  of  p  and  a  standard  deviation  of  o/ 21/z  or 
1 . A  cm . 

What  about  other  values  of  a  and  P?  An  estimator  is  called 
unbiased  if  its  mean  value  is  equal  to  the  quantity  we  wish  to  esti¬ 
mate,  in  this  case,  p.  To  achieve  this  we  see  that  we  must  have 
a  +  P  -  1. 

An  estimator  is  called  minimum  variance  within  its  class  if  no 
other  estimator  in  its  class  has  smaller  variance.  The  minimum  value 
of  the  factor  (a2  +  pz)  ,  subject  to  a  +  P  -  1,  is  attained  when 
a  -  p  -  1/2. 

Any  estimator  of  the  form  abx  +  pb2  is  called  a  linear  estimator. 
From  the  above  we  see  that  such  an  estimator  is  an  unbiased  linear 
estimator  if  o  +  P  -  1,  and  it  is  the  minimum  variance  unbiased  linear 
estimator  if  a  -  p  -  1/2. 

Remark:  The  term,  E[(bj  -  /0(b2  -  p)],  in  the  above  equation  for 
Var  (h(bj  ,b2) )  is  called  the  covariance  of  bj  and  b2,  and  is  denoted  by 
Cov(b1  .b..)  .  This  is  a  very  special  case  of  the  covariance  since  the 
joint  frequency  function  of  bx  and  b2,  here,  fCb^fCbj),  is  the  product 
of  two  functions,  each  depending  on  only  one  of  the  variables.  In  such 
a  case  the  covariance  is  zero  because  it  can  be  written  as  the  product 


of  two  integrals,  each  of  which  is  zero.  We  postpone  discussion  of 
covariance  in  more  general  situations  until  matrix  notation  has  been 
introduced . 

3.2.  Example  assuming  o  is  unknown 

Commonly  one  does  not  know  a  apriori,  as  was  assumed  in  Sec.  3.1., 
but  rather  needs  to  estimate  it  from  the  data.  Such  an  estimate  can  be 
obtained  from  the  sum  of  squares  of  residuals. 

Define  the  residuals  for  the  two  measurements  by 
ri  -  bi  -  g(blfb2)  ,  i  -  1,2 

and  define 

S2  -  r:2  +  r22 

Each  residual  is  a  function  of  the  random  variables,  b:  and  b2, 
and  thus  so  is  S2.  Therefore  S2  is  a  derived  random  variable  having 
its  own  distribution.  The  mean  value  of  S2  may  be  determined  as 
follows : 

E(S2)  -  E(ri2  +  r22} 

-  E{[bx  -  (bj+b2)/2)2  +  [b2  -  (b1+b2)/2 ]2} 

-  1/2  E([bx  -  b2]2} 

-  1/2  ElKbj  -  #i)  *  <b2  -  n))2) 

-  1/2  {Var(b1>  +  Var(b2)  -  2  Cov(b1(b2)} 

-  1/2  {ct2  +  o2  +  0}  -  a2 

Thus  S2  is  an  unbiased  estimator  for  a2.  In  our  example  we  have 
S2  -  (-0.5  cm)2  +  (0.5  cm)2  -  0.5  cm2 
from  which  we  obtain  (0.5) 1/2  -  0.71  cm  as  an  estimate  of  a. 

In  the  more  general  case  of  m  observations  and  n  parameters  being 
estimated  the  expected  value  of  S2  is  (m-n)o2.  Thus  S2/(m-n)  is  an 
unbiased  estimator  for  a2.  In  our  example  we  obtained  E(S2)  -  a2 
because  we  have  m  -  2  and  n  -  1.  The  difference,  (m-n) ,  is  called  the 
number  of  degrees  of  freedom  in  the  problem. 

To  estimate  the  dispersion  of  S2  requires  more  information  or  more 
assumptions.  The  usual  approach  is  to  assume  the  distribution  from 
which  the  data  values,  bt,  arise  is  a  normal  distribution.  Then  the 
scaled  derived  random  variable  S 2/o2  will  have  a  x2  distribution  with 
m-n  degrees  of  freedom.  The  normal  distribution  and  x2  distribution 
will  be  defined  in  Sec.  5. 


Discussion  along  the  above  lines  could  be  carried  out  for  the  case 
of  Sec.  1.3  in  which  the  observations  were  made  with  differing  precis¬ 
ions.  We  will  not  do  this  however  because  it  will  be  much  more  effic¬ 
ient  to  introduce  matrix  notation  and  treat  the  general  problem  of  the 
linear  estimation  of  n  parameters  using  m  observations  subject  to  an 
aprior  covariance  matrix  on  the  errors  of  the  observations. 


4.  The  multidimensional  linear  estimation  problem 
4.1.  Notation  and  concepts  of  linear  algebra 

The  anticipated  applications  of  this  paper  all  involve  real 
numbers  and  thus  we  shall  limit  our  discussion  to  this  context.  One 
should  be  aware,  however,  that  the  concepts  presented  have  identical  or 
very  similar  analogues  in  complex  n-space.  For  further  details  on 
anything  introduced  in  this  section  see  [Golub  and  Van  Loan]  or  [Lawson 
and  Hanson] . 

We  shall  use  Rn  to  denote  n- dimensional  real  space.  A  point  in 
Rn  is  an  n-dimensional  real  vector,  and  will  be  denoted  by  a  lower  case 

roman  or  greek  letter,  e.g.  x,  with  real  components,  xx . x^.  An 

mxn  matrix  is  an  array  of  m  rows  and  n  columns  of  real  numbers,  and 
will  be  denoted  by  an  upper  case  roman  or  greek  letter,  e.g.  B.  The 
transpose  of  a  matrix,  B,  will  be  denoted  by  Bfc. 

A  transformation  between  two  vector  spaces  will  be  called  a  linear 
transformation  if  it  involves  just  a  matrix  multiplication,  and  an 
affine  transformation  if  it  consists  of  a  matrix  multiplication  plus  an 
additive  constant  vector. 

The  number  of  linearly  independent  rows  of  a  matrix,  B,  is  the 
same  as  the  number  of  linearly  independent  columns  and  this  number  is 
called  the  rank  of  B.  If  B  is  nxn  and  of  rank  n  it  is  nonsingular  and 
has  a  unique  inverse  matrix  that  we  denote  by  B"1.  Also,  if  B  is 
nonsingular,  the  matrices  (B1)*1  and  (B'1)1,  exist  and  are  equal  and 
will  be  denoted  by  B'fc. 


The  largest  rank  possible  for  an  nxm  matrix  is  min(m,n).  A  matrix 
having  this  maximal  rank  is  said  to  be  of  full -rank.  A  matrix  whose 
rank  is  less  than  this  maximal  rank  is  called  rank -deficient . 

When  it  is  necessary  to  distinguish  between  a  row  vector  and  a 
column  vector,  we  shall,  for  example,  let  x  denote  a  column  vector  and 
xl  denote  a  row  vector.  For  example,  if  x  and  y  are  n-dimensional 
vectors,  xfcy  denotes  the  scalar  valued  inner  product  and  xyfc  denotes 
the  nxn  matrix  valued  outer  product. 

The  Euclidean  norm  of  a  vector  x  is  denoted  by 

llxll  -  (xhc)1'2 


The  spectral  norm  of  a  matrix  B  is  denoted  by 


||B||  -  Max (  || Bx |!  :  ||x||  -  1  ) 

-  [WB^)]1'2 

where  ^max(BtB)  denotes  the  largest  eigenvalue  of  BtB. 

Two  n-vectors,  x  and  y,  are  mutually  orthogonal  if  x*y  -  0.  A  set 
of  n-vectors,  xk,  xk,  are  mutually  orthogonal  if  each  pair  is 

mutually  orthogonal.  A  set  of  n-vectors,  Xj,  xk,  is  orthonormal 

if  the  set  is  orthogonal  and  each  vector  has  unit  euclidean  norm. 

A  square  matrix  Q  is  called  orthogonal  if  its  transpose  is  also 
its  inverse,  i.e.,  QlQ  -  I,  where  I  denotes  the  identity  matrix.  If  Q 
is  orthogonal  its  row  vectors  constitute  an  orthonormal  set  of  vectors 
and  the  same  is  true  for  its  column  vectors. 


Multiplication  of  a  vector  (respectively  matrix)  by  an  orthogonal 
matrix  preserves  its  euclidean  norm  (respectively  spectral  norm).  Thus 
if  Q  is  an  orthogonal  matrix  then  ||Qx||  -  ||x||  and  ||QB||  -  ||b||  . 


A  2 -dimensional  orthogonal  matrix  is  either  a  rotation  matrix 


Q 

or  a  reflection  matrix 

Q 


cos 
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-sin 

sin 

8 

cos 

cos 

8 

sin 

sin 

8 

cos 

] 

] 


An  n- dimensional  orthogonal  matrix,  with  n  £  2,  can  be  represented  as 
the  product  of  at  most  n(n-l)/2  special  orthogonal  matrices  each  of 
which  represents  either  a  rotation  or  a  reflection  in  the  plane  defined 
by  some  pair  of  coordinate  axes. 


A  square  matrix  A  is  symmetric  if  A1,  -  A.  A  symmetric  matrix  is 
positive  definite  if  xfcAx  >  0  for  every  n-vector  x  /  0  and  nonnegative 
definite  (also  called  positive  semidef inite)  if  xtAx  £  0  for  every  n- 
vector  x  /  0.  Note  that  the  class  of  nonnegative  definite  matrices 
includes  the  class  of  positive  definite  matrices.  A  positive  definite 
matrix  is  nonsingular,  and  its  inverse  matrix  is  positive  definite. 


If  A  is  positive  definite,  the  scalar  quantity,  x^Ay,  may  be 
regarded  as  a  generalized  inner  product  relative  to  the  matrix,  A.  The 
generalized  inner  product  has  analagous  properties  to  the  ordinary 
inner  product,  but  with  some  changes  of  terminology.  For  example, 
whereas  x  and  y  are  mutually  orthogonal  if  x^  -  0,  they  are  mutually 
conjugate  with  respect  to  A  if  x*Ay  -  0. 

Every  matrix  A  of  the  form  A  -  B^  or  A  -  BHJB,  where  B  is  any  mxn 
matrix  and  W  is  nxn  nonnegative  definite,  is  nonnegative  definite.  If 
the  column  vectors  of  B  are  linearly  independent  and  W  is  positive 
definite,  then  A,  given  by  either  of  the  above  two  expressions,  is 
positive  definite. 


A  partial  converse  of  this  is  the  fact  that  any  n-dimensional 
symmetric  nonnegative  definite  matrix,  A,  has  a  Choleskv  factorization 
of  the  form,  A  -  UHJ,  where  U  is  an  nxn  upper  triangular  matrix.  If  A 
is  positive  definite,  then  U  is  uniquely  determined  by  A  to  within  the 
signs  of  the  rows  of  U.  That  is,  if  U  satisfies  A  -  UHJ,  then  so  does 
the  matrix  V  obtained  by  multiplying  any  row  of  U  by  -1.  Thus,  if  A  is 
positive  definite,  we  may  standardize  its  upper  triangular  Cholesky 
factor,  U,  by  requiring  that  its  diagonal  elements  be  positive,  then  U 
is  uniquely  determined  by  A. 

It  is  sometimes  more  convenient  to  focus  attention  on  the  left 
member  of  the  Cholesky  factorization.  Thus  writing  L  for  U*-  we  may 
write  the  factorization  as  A  -  LLl. 

Every  symmetric  matrix  A  has  an  eigensvstem  factorization  of  the 
form,  A  -  VAV*-,  where  A  is  an  nxn  diagonal  matrix,  and  V  is  an  nxn 
orthogonal  matrix.  The  diagonal  elements  of  f^are  the  eigenvalues  of  A 
and  the  column  vectors  of  V  are  the  eigenvectors  of  A.  Note  that  the 
equation  satisfied  by  these  matrices  can  also  be  written  as  AV  —  VA. 

The  n  eigenvalues  of  a  symmetric  matrix  are  uniquely  determined  by 
the  matrix.  The  eigenvalues  of  a  symmetric  matrix  are  all  positive  if 
and  only  if  the  matrix  is  positive  definite  and  are  all  nonnegative  if 
and  only  if  the  matrix  is  nonnegative  definite. 

Every  mxn  matrix,  B,  has  a  singular  value  decomposition,  of  the 
form,  B  -  USV1-,  where  U  is  an  mXm  orthogonal  matrix,  V  is  an  nxn 
orthogonal  matrix,  and  S  is  an  mXn  matrix  that  is  all  zero  except  for 
the  diagonal  terms,  which  may  be  positive  or  zero.  Denoting  the 
diagonal  terms  of  S  by  sit  i  -  1,  ....  min(m,n) ,  it  is  often  useful  to 
assume  these  are  ordered  so  that  Sj  £  s2  2:  ...  The  numbers,  s1,  are 
called  the  singular  values  of  B.  The  number,  say  k,  of  nonzero 
singular  values  is  equal  to  the  rank  of  B.  Since  BtB  -  VStSVt,  it 
follows  that  the  numbers  sj,  i  -  1,  . . . ,  k,  are  the  nonzero  eigenvalues 
of  BtB ,  and  the  column  vectors  of  V  are  the  corresponding  eigenvectors 
of  BtB . 

The  condition  number  of  a  full -rank  matrix  is  the  ratio  between 
its  largest  and  smallest  nonzero  singular  values.  Loosly  speaking  the 
condition  number  of  a  matrix  is  an  upper  bound  on  the  amount  by  which 
relative  errors  in  a  vector  will  be  magnified  when  the  vector  is 
operated  upon  by  the  matrix,  either  by  direct  multiplication  or  by 
solving  a  system.  A  matrix  is  called  well -conditioned  if  its  condition 
number  is  near  one,  and  ill-conditioned  if  its  condition  number  is 
large.  A  matrix  has  the  minimal  possible  condition  number  of  one  if 
and  only  if  either  its  rows  or  columns  (or  both)  are  mutually  ortho¬ 
normal.  Thus  a  square  matrix  has  a  condition  number  of  one  if  and  only 
if  it  is  an  orthogonal  matrix. 

Every  mxn  matrix,  B,  has  a  OR  factorization,  of  the  form,  B  -  QR, 
where  Q  is  an  nxn  orthogonal  matrix  and  R  is  an  upper  triangular  mxn 
matrix.  If  m  >  n  and  Rank(B)  -  n,  the  first  n  column  vectors  of  Q  form 
an  orthogonal  basis  for  the  linear  space  spanned  by  the  column  vectors 
of  B,  and  the  matrix  R  of  the  QR  factorization  of  B  is  also  a  right 
Cholesky  factor  of  the  positive  definite  matrix,  B^,  i.e.,  BfcB  -  R*®. 


The  determinant  of  a  triangular  matrix,  L,  denoted  by  Pet (L1)  .  is 
the  product  of  the  diagonal  elements  of  L.  The  determinant  of  a 
symmetric  nxn  matrix,  A,  is  the  product  of  the  n  eigenvalues  of  A.  The 
determinant  of  the  identity  matrix,  or  of  any  other  orthogonal  matrix, 
is  1.  The  determinant  of  the  product  of  a  set  of  matrices  is  the 
product  of  the  determinants  of  the  matrices.  As  examples,  if  A  is 
positive  definite,  with  the  Cholesky  factorization,  A  -  LL1,  then 

(Det(A) ] 1/2  -  Det(L) 

and  if  B  is  a  square  matrix  with  QR  factorization,  B  -  QR,  then 

Pet(B)  -  Det(R) 

4.2  N-dimensional  random  variables 

An  n-dimensional  frequency  function  or  probability  density 
function  is  a  function  defined  over  Rn  that  is  nonnegative  and  whose 
integral  over  all  of  Rn  exists,  and  has  the  value  1.  Letting  x  denote 
an  n-dimensional  vector,  the  mean  value  of  an  n-dimensional  distribu¬ 
tion  having  frequency  function  f  is  the  n-vector  /i  defined  by 

/i  -  E(x)  -  f  x  f(x)  dx 

where  here  the  integral  sign  denotes  integration  over  all  of  . 

The  nxn  covariance  matrix  of  an  n-dimensional  distribution  with 
frequency  function,  f,  is  defined  by 

H  -  Cov(x)  -  E[(x  -  /i)(x  -/*)*•] 

-  /  (x  -  fi)  (x  -  fi)t  f (x)  dx 

From  the  form  of  this  expression  it  can  be  shown  that  the  matrix  H  is 
symmetric,  i.e.,  h^  -  hJ1  for  all  i  and  j,  and  also  H  is  nonnegative 
definite . 

It  can  be  verified  that 

Cov(x)  -  E(xxt)  - 

If  a  new  random  variable,  u,  is  defined  as  a  linear  transformation 
of  x,  say,  u  -  Ax,  then 

E(u)  -  A  E(x) 

and 

Cov(u)  -  A  Cov(x)  A1 

An  important  special  case  arises  when  the  function  f(x)  is  the 
product  of  n  functions,  each  depending  on  just  one  component  of  x,  i.e. 


In  such  a  case  we  may  assume  without  loss  of  generality  that  the 
functions,  f1(  are  scaled  so  that  each  one  is  a  frequency  function. 
Then  the  off-diagonal  terms  of  H  are  all  zero,  and  each  diagonal  term, 
h11 ,  is  just  the  one -dimensional  variance  of  the  component,  x1 ,  determ¬ 
ined  by  the  frequency  function  fi(x1). 


The  separate  components  of  x  are  said  to  be  independently  dis¬ 
tributed  if  and  only  if  all  of  the  off-diagonal  elements  of  the 
covariance  matrix  are  zero. 


4.3.  Linear  estimation  of  n  parameters  using  m  observations 


Assume  that  an  m- dimens ional  phenomenon  (this  may  be  a  number  of 
instances  of  some  lower  dimensional  phenomena)  to  be  observed  has  a 
distribution,  D,  with  a  frequency  function,  f,  having  an  m-dimensional 
mean  vector,  rj ,  and  an  mxm  positive  definite  covariance  matrix,  H. 


Assume  further  that  rj  is  representable  as  a  linear  combination  of 
n  m-vectors,  bA ,  i  -  1,  ...,n.  Thus  we  are  assuming  there  are  coeffic¬ 
ients,  such  that 


Letting  B  denote  the  mxn  matrix  with  column  vectors  bA,  and  £ 
denote  the  n-vector  with  components  |i,  this  equation  can  be  written  as 


tj  -  BC 


To  avoid  complications  that  would  obscure  the  central  ideas,  we 
assume  that  m  >  n,  and  the  vectors  b1  are  linearly  independent.  It 
follows  that  B  is  of  rank  n. 


Various  linear  estimation  problems  can  be  based  on  this  model, 
depending  on  which  elements  of  the  model  are  assumed  to  be  known  and 
which  are  to  be  estimated.  Consider  first  the  case  in  which  H  and  B 
are  known,  and  f  and  rj  are  unknown.  Suppose  we  have  an  observation,  y, 
regarded  as  a  random  sample  from  the  distribution,  D.  We  wish  to 
estimate  £  and  r?  and  the  covariance  matrices  of  each  of  these  est¬ 
imates  . 


The  Method  of  presentation  used  in  Sections  4.3.1  through  4.3._ 
follows  pp.  36-48  of  [Plackett,  I960]. 


4.3.1.  The  estimator  £  and  its  covariance  matrix 


First  note  that  y  itself  is  a  linear  unbiased  estimator  for  r\ 
since  E(y)  -  rj ,  however  we  shall  see  that  a  better  estimator  is 
available . 


Our  linear  estimator  for  £  will  be 


i  -  (BtH‘1B)'1BtH*1y  -  Py 


< 


Here  we  have  introduced  the  nxm  matrix 

(2)  P  - 

A 

The  estimator  £  is  clearly  the  unique  solution  of  the  linear  system 

(3)  BHi^Bf  -  B^'V 
which  will  be  discussed  further  in  Sec.  4.3.3. 


To  verify  that  |  is  an  unbiased  linear  estimator  for  £  we  compute 
E[£]  -  E(Py]  -  P  E[y]  -  P  ij  -  P  Bf 


-  [  (BtH"1B) ]  B£ 

-  (BtH'1B)-1(BtH'1B)^ 


Note  that  the  property  of  P  that  was  crucial  here  was 

PB  -  I 

Since  B  is  a  rectangular  matrix  it  does  not  have  an  inverse,  but  any 
matrix  G  satisfying  GB  -  I  is  called  a  left  inverse  of  B.  There  will 
in  general  be  many  such  matrices  and  all  provide  unbiased  linear 
estimators  for 

The  covariance  matrix  for  the  estimator,  £,  may  be  computed  as 
Cov(0  -  Cov(Py)  -  P  Cov(y)  Pfc  -  P  H  P*- 

-  [  (BtH”1B)‘1BtH'1]  H  [H*1B(BtH'1B)*1] 

-  (BtH’1B)-1  (BtH*1B)  (BtH'1B)~1 

-  (BtH*1B)*1 

Among  unbiased  linear  estimators,  the  estimator,  Py,  has  the 
minimum  variance,  in  the  sense  that  if  Gy  is  any  other  unbiased  linear 
estimator,  i.e.,  G  satisfies  GB  -  I ,  then  the  difference,  Cov(Gy)  - 
Cov(Py) ,  will  be  a  nonnegative  definite  matrix.  To  verify  this  we 
write  a  matrix  expression  that  is  nonnegative  definite  due  to  its  form 
and  then  show  it  is  equal  to  the  required  difference. 

(G-PWG-P)1  -  GHG1  -  GHPt  -  PHC1  +  PHP1 

-  Cov(Gy)  -  GH[H"1B(BtH*1B)'1J 

-  [  (BtH'1B)'1BtH’1]HGt  +  Cov(Py) 

-  Cov(Gy)  -  Cov(Py)  -  Cov(Py)  +  Cov(Py) 


-  Cov(Gy)  -  Cov(Py) 


4.3.2.  The  estimator  rj  and  its  covariance  matrix 

Since  q  -  B£ ,  we  define  our  estimator  for  r?  to  be 

r,  -  b£  -  BPy 

This  estimator  is  unbiased  since 

E [q]  -  B  E[|]  -  B  i  -  r, 

The  covariance  matrix  for  fj  is 

Cov(rj)  -  Cov(BO  -  B  Cov(£)  B1 

A 

Since  we  obtained  two  different  expressions  for  Cov(£)  we  may  write 
either 

Cov(fj)  -  B  PHP1  B‘  -  BP  H  (BP)1 


or 


Cov (fj)  -  B  (BtH'1B)'1  B1 


Since  P  is  a  left  inverse  for  B,  it  follows  that  BP  is  a  left 
identity  for  B,  i.e.,  (BP)B  -  B.  It  can  be  verified  that  this  Is  the 
crucial  property  that  permitted  verification  that  BPy  is  an  unbiased 
estimator  of  rj .  Thus  if  J  is  any  left  identity  for  B,  i.e.,  J  is  an 
mxm  matrix  satisfying  JB  -  B,  then  Jy  is  an  unbiased  estimator  for  rj . 


Among  all  unbiased  linear  estimators  for  r/ ,  BPy  has  the  minimum 
variance .  in  the  sense  that  if  Jy  is  any  other  unbiased  linear  estimat¬ 
or,  i.e.,  J  satisfies  JB  -  B,  then  the  difference,  Cov(Jy)  -  Cov(BPy) , 
is  a  nonnegative  definite  matrix.  This  is  verified  as  follows: 

(J  -  BP)H(J  -  BP)1  -  J  H  Jfc  -  J  H  PtBt  -  BP  H  J1  +  BP  H  (BP)1 


-  Cov(J) 


-  Cov(J) 


-  Cov(J) 


J  H  [H'1B(BfcH'1B)‘1]  B1 
B  [  (BtH"1B)”1BtH‘1]  H  J1  +  Cov(BP) 
B  (BtH'1B)‘1  B1 
B  (BtH‘1B)'1  B1  +  Cov(BP) 

Cov(BP) 


4.3.3.  Interpretation  as  least  squares  estimation 

The  estimator,  £,  defined  in  Eq.  4.3.1. (1),  can  be  derived  as  the 
solution  to  a  certain  weighted  linear  least  squares  problem.  This  is 
the  problem  of  finding  an  n-vector,  x,  to  minimize  the  quadratic 
function 
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\nr  vw~vtt  v*»  w 


5 

S 


(1)  S2  -  (Bx  -  y)1  H'1  (Bx  -  y) 

where  B  is  a  given  mxn  matrix,  y  is  an  m-vector  of  observations,  and  H 
is  an  mxm  positive  definite  symmetric  matrix  which  is  the  known 
covariance  matrix  for  the  distribution  of  errors  in  y. 

The  n-vector  of  partial  derivatives  of  S2  with  respect  to  x  is 

(2)  3S2/3x  -  2BtH"1(Bx  -  y) 

Setting  this  equal  to  zero  gives  the  system  of  equations 

(3)  BtH'1Bx  -  B’dTV 

to  be  solved  for  x.  We  shall  let  S2  denote  the  minimum  value  of  S2, 

A  A 

i.e.,  the  value  of  when  x  -  £. 

Eq.(3)  (see  also  Eq .  A. 3. 1.(3))  is  called  the  normal  equations  for 
the  least  squares  problem  of  minimizing  S2  of  Eq.(l).  Note  that  when  H 
-  I,  the  geometric  interpretation  of  setting  Eq.(2)  equal  to  zero  is 
that  the  residual  vector,  Bx-y  is  required  to  be  orthogonal  (i.e., 
perpendicular  or  normal)  to  all  columns  of  the  matrix,  B.  I  presume 
this  is  the  reason  the  word  "normal"  has  been  associated  with  Eq.(3). 
When  H  *  I,  Bx-y  is  required  to  be  conjugate  to  all  columns  of  B, 
relative  to  the  positive  definite  matrix,  H"1. 

The  use  of  normal  equations  in  the  form  B^x  -  B1^  dates  back  to 
Gauss,  1821.  The  form  treated  here,  involving  H,  was  introduced  by  A. 
C.  Aitken,  1934.  Reference:  [Plackett] . 

4. 3. 3.1.  Transformations  of  the  least  squares  problem 

There  are  two  types  of  transformations  that  are  very  useful,  both 
for  the  analysis  and  for  the  computational  solution  of  a  least  squares 
problem.  The  first  transforms  the  case  of  general  H  to  the  case  of  H  - 
I,  while  the  second  decomposes  m-space  into  the  n-dimensional  subspace 
spanned  by  the  columns  of  C,  and  the  complimentary  (m-n) -dimensional 
subspace  orthogonal  to  the  columns  of  C. 

For  the  first  transformation  we  factor  H  as 


H  -  LV- 


and  introduce  C  and  z  defined  by 


C  -  L'1B 


(3)  z  -  L_1y 

Then  Eq.  4. 3. 3.(1)  can  be  rewritten  as 


S2  -  (Cx  -  z)t(Cx  -  z)  -  || Cx  -  z| 


mmmmm 


The  statistical  interpretation  of  this  transformation  is  that  y, 
which  was  a  sample  from  a  distribution  having  mean,  rj ,  and  covariance 
matrix,  H,  is  transformed  to  z,  which  is  a  sample  from  a  distribution 
having  mean,  and  covariance  matrix,  I.  Thus  the  components  of  z 
each  have  unit  variance  and  are  mutually  uncorrelated. 

There  are  various  ways  to  achieve  the  factorization,  H  -  LLfc.  The 
simplest  is  the  Cholesky  decomposition  of  H.  Another  approach  would  be 
to  use  the  eigensystem  decomposition,  H  -  VAV1,  and  then  set  L  -  VA1/2. 


The  goal  of  the  second  transformation  is  to  replace  C  by  a  matrix 
having  all  of  its  nonzero  elements  in  its  first  n  rows.  It  is  conven¬ 
ient,  but  not  essential,  to  choose  this  transformation  so  that  it  also 
transforms  z  to  a  vector  having  all  of  its  nonzero  components  in  the 
first  n+1  positions. 


One  way  to  construct  such  a  transformation  is  by  use  of  the  QR 
decomposition  of  C, 

(5)  c-qJ 

where  Q  is  an  mxn  orthogonal  matrix,  R  is  an  nxn  nonsingular  upper 
triangular  matrix,  and  0  denotes  a  zero  matrix  of  conformable  dimen¬ 
sions,  here  an  (m-n)xn  zero  matrix.  (An  alternative  to  the  use  of  the 
QR  decomposition  for  this  transformation  would  be  the  use  of  the 
singular  value  decomposition.  The  SVD  is  computationally  more  expen¬ 
sive,  but  gives  additional  information  that  is  desirable  in  some 
situations.  Ue  shall  not  discuss  the  SVD  further  in  this  paper.) 


Using  Eq . (5)  in  Eq.(A)  gives 
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2  II  2  - 


x  -  Qtz | 


Let  u  denote  the  first  n  components  of  Qlz  and  let  v  denote  the 
last  m-n  components  of  Qtz,  i.e., 


;  p  -  irx 


ull2  +  ||v||2 


This  last  expression  shows  S2  as  the  sum  of  two  terms,  the  first 
of  which  can  be  reduced  to  zero  by  the  unique  x  that  satisfies  Rx  -  u, 
while  the  second  term  is  independent  of  x.  Thus  the  minimum  value  of 
S2  is 


and  since  the  minimizing  value  of  x  is  unique,  and  is  already  known  to 
be  ( ,  it  follows  that  £  satisfies 


w 

s 

, 

From  Cov(z)  -  I,  Eq.(7),  and  the  orthogonality  of  Q,  we  obtain 


(11) 

and  thus,  also 
(12) 
and 
(13) 


Cov( 


I 


Cov(u)  -  I 


Cov(v)  -  1 


From  Eq.(9)  we  obtain 

(14)  E ( § 2 )  -  E(||v||2)  -  2  E(vf)  -  Z  Var(vi)  -  m-n 

4.3.4.  Introducing  a  scaling  factor  for  H 

The  assumption  that  H  is  completely  known  a  priori  is  often 
unrealistic.  A  useful  weakening  of  this  assumption  is  the  assumption 
that  the  covariance  matrix  of  the  distribution,  D,  from  which  the 
observation  was  sampled,  is  <f>2H,  where  H  is  a  known  positive  definite 
matrix  and  <*>  is  an  unknown  scalar.  This  amounts  to  assuming  that  the 
signs  and  relative  sizes  of  the  elements  of  the  covariance  matrix  of  D 
are  known  but  an  overall  scale  factor  is  unknown.  We  shall  see  that 
this  factor,  4>2 ,  can  be  estimated  using  the  sum  of  squares  of  resid¬ 
uals,  S2,  of  Eq.  4.3.3. (1),  generalizing  the  example  discussed  in  Sec. 
3. 


As  to  practical  methods  for  the  a  priori  definition  of  H,  a  common 
situation  would  be  to  assume  the  errors  in  the  different  components  of 
y  are  uncorrelated  so  that  H  need  only  be  a  diagonal  matrix.  Then  each 
diagonal  element  of  H  would  be  assigned  the  a  priori  variance  of  the 
error  in  the  corresponding  component  of  y.  In  this  case  one  would 

expect  the  value  of  4>  to  turn  out  to  be  1,  and  the  extent  to  which  the 

a  posteriori  estimate  of  <f>  differs  from  1  can  be  a  consideration  in 
assessing  the  quality  of  the  model  relative  to  the  available  data. 

Another  possibility  is  that  one  may  simply  assume  that  the  errors 
in  the  different  components  of  y  are  uncorrelated  and  all  have  the 
same,  but  unknown,  variance.  Then  one  could  set  H  -  I  and  the  a 
posteriori  estimated  value  of  <f>2  would  be  an  estimate  of  the  variance 
of  errors  for  each  component  of  y. 

If  we  start  with  ^2H  as  the  covariance  matrix  for  D  and  repeat  the 
derivations  of  Sections  4.3.1,  4.3.2,  4.3.3,  and  4. 3. 3.1,  we  find  ^2 
cancels  out  in  the  expression  for  the  matrix,  P,  so  the  estimator,  f, 

is  unchanged.  The  covariance  matrix  for  (  inherits  the  factor  4>2 ,  so 

we  obtain 

(1)  Cov(0  -  ^z(BtH'1B) 
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Similarly  the  estimator  rj  is  unchanged,  but  its  covariance  becomes 


(2) 

Co v(ij)  - 

<t>2 

BP  H  (BP) 

or 

(3) 

Cov(f))  -  4>2 

B 

(BtH"1B) _1 

Eqs . 4 . 3 . 3 . 1 . ( 11 - 14)  become 

(M 

Cov( 

u 

V 

)  -  I 

(3) 

Cov(u 

)  “  I 

(6) 

Cov(v 

)  ~  I 

(7) 

E(S2) 

- 

(m-n)$2 

4.3 

.4.1.  An  unbiased  estimator  for  4> 

Eq .  4.3.4. (7)  can  be  rewritten  as 

(1)  E[S2/ (m-n) )  -  4>2 

showing  that  S2/(m-n)  is  an  unbiased  estimator  for  4>z .  We  denote  this 
estimator  by 

(2)  V  -  §2/ (m-n) 

This  provides  a  practical  estimate  for  4>  that  is  needed  in  situations 
such  as  were  described  at  the  beginning  of  Sec.  4.3.4. 

The  next  question  one  typically  asks  is  how  good  is  this  estimate. 
In  terms  of  our  mathematical  model  this  translates  to  the  problem  of 
finding  the  variance  of  the  estimator,  #. 

A  direct  derivation  of  this  variance  for  the  case  of  a  general 
underlying  frequency  function,  f,  would  involve  third  and  fourth 
moments  of  f,  i.e.  integrals  of  y3f(y)  and  y*f (y) .  This  would  not  be 
of  practical  use  since  independent  estimates  of  these  moments  would  not 
generally  be  known.  Thus,  as  was  mentioned  in  Sec.  3,  the  usual 
approach  is  to  assume  that  f  is  some  well  known  frequency  function  that 
provides  a  plausible  model  for  the  observation  errors  in  the  real  world 
system  being  investigated,  and  for  which  the  resulting  distribution  for 
52  is  known. 

Standard  statistical  distributions  of  interest  for  our  purposes 
will  be  presented  in  Sec.  5. 


5.  Standard  distributions 


5.1.  The  normal  or  Gaussian  distribution 

The  one-dimensional  normal  distribution,  also  called  the  Gaussian 
distribution .  having  mean,  r) ,  and  variance,  a2 ,  is  conventionally 
denoted  by  N(q,CT2).  Its  frequency  function  is 

f  (y)  -  (2m)'1/2  a'1  exp{  -  [  (y-r?)/o]2/2  } 

If  y  has  the  distribution,  N(r?,o2),  then  the  random  variable 
defined  by  2  -  ( y-r\)/o  has  the  distribution,  N(0,1).  Values  of  the 
frequency  function  and  distribution  function  for  N(0,1)  are  readily 
available  in  tables  and  from  computer  subroutines.  Some  probabilities 
from  the  distribution,  N(0,1),  are  given  in  the  following  table: 

Table  1.  Probabilities  for  z  £  N(0,1) 


P  P( |z|  <  p) 


05 

0 

0.52 

0.4 

0.68 

0.5 

0.84 

0.6 

1.0 

0.683 

1.96 

0.95 

2.0 

0.954 

3.0 

0.997 

-ho 

1.0 

Let  NCm:  v  ,Z~)  denote  the  m- dimensional  normal  distribution  having 
mean  vector,  rj ,  and  covariance  matrix,  Z.  Here  tj  is  an  m- vector  and  Z 
is  a  positive  definite  mxm  matrix.  The  frequency  function  for  this 
distribution  is 

f(y)  -  W"12  [Det(Z)  ]~1/2  exp  [  -  (y-q)tZ"1(y-q)/2  ] 

A  very  significant  property  of  the  normal  distribution  is  the  fact 
that  an  affine  transformation  of  a  normal  random  variable  will  again  be 
a  normal  variable.  Specifically  let  P  be  an  nxm  matrix  with  n  <  m  and 
Rank(P)  -  n.  Let  y  e  N(ra;r?,Z).  Define  a  new  n-dimensional  random  var¬ 
iable,  z  -  f  +  Py.  Then  z  £  N(n;f+Pq ,PZPt) . 

As  an  important  special  case  of  such  an  affine  transformation,  let 
denote  the  lxm  matrix  (i.e.  row  vector)  whose  elements  are  all  zero 
except  for  a  1  in  column  i.  The  product  EAy  is  just  the  component  yA 
of  y,  and  EtZE^  is  just  the  component  <7li  of  Z.  It  follows  that  yi  t 
N(fji,ali).  [The  notation  here  is  a  bit  unfortunate.  Note  that  ali  ia 
the  variance  of  yl  and  thus  the  standard  deviation  of  yi  is  a\ ,2]  . 

An  affine  transformation  that  is  often  useful  is  the  transforma¬ 
tion  from  the  general  case  of  N(m;»7,Z)  to  the  special  case  of  N(m;0,I). 
Denote  the  Cholesky  factorization  of  Z  by  Z  -  LLt.  Then  if  y  has  the 
distribution,  N(m;»j,Z),  the  random  variable  defined  by 


.Nji 

& 
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z  -  L_1(y-»?) 

has  the  distribution,  N(m;0,I). 

As  an  example  of  the  multivariate  normal  distribution,  consider 
N(2;0,I),  which  is  also  called  the  circular  bivariate  normal  distribu¬ 
tion  .  The  integral  of  the  density  function  for  this  distribution  over 
a  disk  of  radius,  p,  has  a  particularly  simple  expression,  namely, 
l-exp( -pz/2) .  Some  probabilities  from  this  distribution  are  given  in 
the  following  table: 

Table  2.  Probabilities  for  z  e  N ( 2 ; 0 , I ) 


na 


p 

p2 

PC  ||z ||2  <  P2)  -  1 

0.32 

0.10 

0.050 

1.00 

1.00 

0.393 

1.18 

1.39 

0.500 

2.00 

4.00 

0.865 

2.45 

5.99 

0.950 

3.00 

9.00 

0.989 

-Ho 

1.000 

5.2.  Conf idence  regions 

A  region  in  which  a  random  sample  is  expected  to  appear  p%  of  the 
time  is  called  a  p%  confidence  region.  For  example,  if  z  e  N(0,1)  the 
interval  [-0.84,  0.84]  is  a  60%  confidence  interval  for  z.  There  are 
infinitely  many  other  60%  confidence  regions,  however,  for  example, 

[-05,  -t~®]  ,  or  [-«>,  05],  or  [-«,  -0 . 52  ]u[0 . 52 ,  +<*>]. 

In  higher  dimensional  spaces  there  is  even  more  variety  in  the 
shapes  and  connectedness  properties  of  regions  that  could  be  chosen  to 
be  a  p%  confidence  region.  For  example  one  could  choose  a  rectangle, 
or  other  polygon,  or  a  circle,  or  nonconvex  or  disjoint  regions.  A 
choice  having  practical  appeal  for  our  purposes  is  the  one  that  can  be 
characterized  as  being  the  region  in  which  the  density  function  exceeds 
a  certain  fixed  value.  Then  the  probability  density  is  larger  at  every 
point  interior  to  the  region  than  it  is  at  any  point  exterior  to  the 
region.  In  the  case  of  the  multivariate  normal  distribution  such 
regions  will  be  ellipsiods.  Recall  that  spheres,  ellipses,  circles, 
and  one  dimensional  intervals  are  all  special  cases  of  ellipsoids. 

As  we  shall  see  in  the  immediately  following  paragraphs,  this 
choice  of  the  definition  of  a  confidence  region  provides  a  significant 
technical  convenience,  in  that  the  case  of  a  general  ellipsoid  reduces 
easily  to  the  case  of  a  sphere,  and  thence  to  dependence  on  a  scalar, 
rather  than  an  n- dimens ional ,  random  variable. 

For  y  £  N(m;*7,2),  we  shall  define  the  p%  confidence  ellipsoid  to 
be  the  unique  ellipsoid,  C,  defined  by 

(1)  C  -  {y  :  <y-»?)tZ-1(y-r?)  <  pz) 


/  ‘,V* '*«/>'  - 

*  ■  M  r  a  *  ■  *  _ 


•  .  -  A  .  ..A  ,  r  -A 


r  V 


where  p  is  the  unique  nonnegative  value  that  permits  the  integral  of 
the  density  function  over  C  to  have  the  value,  p/IOC. 

Note  that  the  frequency  function  for  the  n-dimensional  normal 
distribution  is  the  product  of  [Det(Z)]*1/2  and  a  function  of  (y- 

with  Z  being  positive  definite.  Any  function  of  this  form 
has  the  property  that  for  a  fixed  dimension,  n,  its  integral  over  an 
ellipsoid  C  defined  as  in  Eq.(l)  depends  only  on  p  and  not  on  Z  or  . 
Thus  the  functional  relation  between  p  and  p  given  in  Table  2  applies 
not  only  to  the  circular  bivariate  normal  distribution  but  also  to  the 
general  (elliptical)  bivariate  normal  distribution  with  ||z||2  in  the 
table  heading  replaced  by  (y-rj)tZ'1(y-»j) . 

We  see  that  determining  critical  sizes  for  confidence  ellipses 
defined  as  in  Eq.(l)  depends  on  knowledge  of  the  distribution  of  || z j[ 2 , 
for  Z£N(m;0,i).  This  distribution  is  called  the  x2  distribution  with  m 
degrees  of  freedom,  and  will  be  discussed  in  Sec.  5.3. 

5.2.1.  Geometric  characterization  of  a  confidence  ellipsoid 

Let  C  be  the  ellipsoid  defined  above  in  terms  of  rj ,  Z,  and  p. 

Write  the  eigenvalue  factorization  of  Z  as  Z  -  QAQ*-  where  Q  is  ortho¬ 
gonal  and  A  is  diagonal  with  positive  diagonal  elements  (the  eigen¬ 
values  of  Z) ,  Ait  i  -  1,  ...,  m.  We  shall  assume  the  eigenvalues  are 
ordered  so  that  Xl  >  A2  >  ...  >  Am.  The  column  vectors  of  Q,  which  we 
shall  denote  by  qA ,  i  —  1,  . . . ,m,  are  the  corresponding  eigenvectors  of 
Z. 

Note  that  Z"1  -  QA^Q1,  so  the  eigenvectors  of  Z*1  are  the  same  as 
those  of  Z  while  the  eigenvalues  are  the  reciprocals  of  those  of  Z. 

The  ellipsoid,  C,  is  centered  at  rj  and  its  principal  axes  are 
parallel  to  the  eigenvectors  of  Z  with  qa  giving  the  direction  of  the 
major  axis.  The  semi-diameter,  ai,  in  the  direction  of  the  ith 
principal  axis  can  be  determined  by  solving  the  equation 

(Qiqi)tZ‘1(aiqi)  -  p2 

from  which 

aiAi1  "  P2 

“  P>VZ 

Consider  the  case  in  which  Z  -  o2I.  Then  the  eigenvalues  of  Z  are 
all  equal  to  o2,  so  the  confidence  ellipsoid  is  a  sphere  of  radius 
a  —  pa . 

5.3.  The  x2  distribution 

We  have  encountered  two  different  situations,  Eq.  4. 3. 3. 1.(14)  and 
Sec.  5.2,  in  which  our  analysis  led  to  consideration  of  the  distribu¬ 
tion  of  the  sum  of  squares  of  of  random  variables.  If  these  variables 
are  independent  samples  from  N(0,1),  their  sum  of  squares  has  a 
distribution  called  the  x2  distribution. 


Eq .  4. 3. 3. 1.(9)  showed  that  S2  was  the  sum  of  squares  of  the 
components  of  the  (m-n) -vector ,  v,  and  it  was  also  established  that 
E(v)  -  0  and  Cov(v)  -  ^2I .  If  we  now  add  the  assumption  that  the 
distribution  of  the  original  random  variable,  y,  is  the  normal  distri¬ 
bution,  i.e.,  y  t  N(m \r)  ,4>2'Z) ,  then,  since  v  was  obtained  as  an  affine 
transformation  of  y,  the  distribution  of  v  is  N(n;0,^2I). 

A  random  variable  defined  as  the  sum  of  squares  of  k  independent 
samples  from  N(0,1),  or  equivalently  as  the  sum  of  squares  of  com¬ 
ponents  of  a  sample  vector  from  N(k;0,I),  is  said  to  have  the  x2  ( chi  - 
squared)  distribution  with  k  degrees  of  freedom.  We  will  denote  this 
distribution  by  x2[k] . 

Letting  t  denote  the  independent  variable,  the  density  function 
for  the  x2[k]  distribution  is  zero  for  t  <  0  and 

f-(k/2)-l  -,-t/2 

f(t)  -  -  for  t  £  0 

2 k/2  T(k/2) 


The  distribution  ^[k]  has  a  mean  value  of  k.  The  variance  is  2k 
and  thus  the  standard  deviation  is  (2k)1/z.  Values  of  the  distribution 
function  for  x^fk]  are  available  from  tables  or  computer  subroutines. 

Unlike  the  normal  distribution,  the  x2  distribution  is  unsymmet- 
ric.  The  mode  and  median  of  the  x2[k]  distribution  are  each  less  than 
the  mean.  If  s  is  a  x2[k]  variable,  the  transformed  variable, 
t  -  (2s)1/z,  has  a  distribution  that  is  closely  approximated  by 
N(  (2k-l)1/z,  1)  for  k  >  30. 


Table  3. 

Values 

of  P(s£x) 

for  s  e 

for 

selected 

values  of 

k  and  x 

x  - 

0.05 

0.5 

0.95 

k 

1 

0.004 

0.46 

3.8 

2 

0.10 

1.39 

5.99 

5 

1.1 

4.4 

11.1 

10 

3.9 

9.3 

18.3 

20 

10.9 

19.3 

31.4 

30 

18.5 

29.3 

43.8 

Note  that  the  row  for  k  -  2  In  this  table  agrees  with  values  in 
Table  2  since  |] z ]| 2  of  Table  2  is  a  )^[2]  variable. 

We  may  now  treat  the  question  raised  In  Sec.  4.3.4  regarding  the 
dispersion  of  §2/(m-n)  as  an  estimator  for  4"2 •  If  ve  assume  the  data 
vector  y  is  a  sample  from  N(m;»7,^2H) ,  with  H  known,  then  §2/^2  is  a 
sample  from  x^m-n]  . 


For  example,  if  m-n  -  20,  we  may  conclude  that  there  is  a  90% 
probability  that 

10.9  £  §2/*2  £  31.4 


and  a  5%  probability  that 


S2/<£2  £  31.4 


There  are  various  ways  one  may  use  these  results.  If  one  has  a 
prior  notion  of  a  reasonable  value  for  <f> ,  then  if  52  exceeds  31.4  times 
the  square  of  that  value,  it  may  be  taken  as  evidence  that  the  model  is 
not  consistent  with  the  data,  or  the  prior  value  of  ^  was  incorrect. 

If  one  has  no  prior  notion  of  the  value  of  tf> ,  then  one  might 
conclude  that  there  is  a  90%  probability  that  4>2  satisfies 

§2/31 . 4  <  <f>2  <  S2/10 . 9 

5.4.  Student's  t  distribution 

In  Sec.  4.3.4  the  formula  Cov(£)  -  ^2(BtH"1B)'1  was  obtained. 

Also  it  has  been  noted  that  if  y  has  a  normal  distribution  then  £  does 
also,  in  particular 

(1)  I  £  N(n;£  ,^2(BtH‘1B)'1) 

Thus  if  B,  H,  and  <f>  are  known,  one  can  compute  a  p%  confidence 
ellipsoid  for  £,  or  p%  confidence  intervals  for  individual  components 
of  £,  as  discussed  in  Sec.  5.2. 

If  $  is  not  known  it  can  be  replaced  by  its  estimator, 

(2)  *  -  (SVOn-n)]1'2 

(see  Eq.  4 . 3 . 4 . 2 . (2) )  ,  if  m-n  is  sufficiently  large  so  the  variance  of 
#  is  sufficiently  small.  Alternatively,  particularly  when  m-n  is  not 
large,  a  different  approach  may  be  used,  involving  the  Student's  t 
distribution.  This  distribution  was  presented  in  1908  by  an  anonymous 
author  identified  as  "Student".  Reference:  [Plackett] . 

Eq.(l)  can  be  rewritten  as 

(3)  v*  *  N(n;£/^,  (BtH‘1B)’1) 

which  serves  to  focus  attention  on  \/4>-  We  now  ask:  What  is  the 
distribution  of  £/#? 

Note  that  £  has  a  normal  distribution  and  $2  is  within  a  known 
factor  of  being  a  x2  variable.  Furthermore  these  two  variables  are 
statistically  independent,  since  £  is  a  function  of  u  and  not  v, 

[Eq .4 . 3 . 3 . 1 . (10) ]  ,  S2  is  a  function  of  v  and  not  u,  [Eq .4 . 3 . 3 . 1 . (14) ] , 
and  u  and  v  are  statistically  independent,  [Eq . 4 . 3 . 3 . 1 . (11)  or 

4. 3. 4.  (4)]. 

Thus  we  need  to  know  the  distribution  of  the  ratio  of  a  normal 
variable  to  the  square  root  of  a  statistically  independent  x2  variable. 
That  is  what  the  Student's  t  distribution  is. 


£ 


Let  w€N(n;0,2)  and  /92 e X"2 [ ^  1  .  with  these  distributions  being 
mutually  independent.  Then  the  n-dimensional  vector  random  variable 

(4)  t  -  k1/2w//3 

has  the  n-dimensional  Student's  t  distribution  with  k  degrees  of 
freedom  and  kernel  matrix  2.  The  frequency  function  for  t  is 


f(t)  - 


_ r  [ (k+n)/2 ) _ 

T(k/2)  (k?r)n/2[Det(2)  ) 1/2  [  1  +  t^t] <k+n)/2 


Furthermore , 


E(t)  -  0 


(7)  Cov(t)  -  [k/(k-2)]2 

Remark:  The  frequency  function  for  the  1-dimensional  Student's  t 

distribution  with  2  -  1  can  be  found  in  numerous  sources,  however  I 
wish  to  thank  Dr.  J.  Myhre  for  referring  me  to  [Anderson]  where  Eqs.(5- 
7)  above  are  given  in  Problem  27,  Page  283. 

We  return  now  to  the  study  of  £/$.  To  obtain  w  and  /?  satisfying 
the  conditions  to  define  an  n-dimensional  Student's  t  variable,  let  w  — 
(£-0/4>  and  /9  -  §/<£.  Then  the  variable 

(8)  t  -  (m-n)1/2(|-£)/5  -  (1*0/* 

has  the  Student's  t  distribution  with  (m-n)  degrees  of  freedom  and 
kernel  matrix  2  - 

As  with  the  density  function  for  the  multivariable  normal  dis¬ 
tribution,  the  density  function  for  the  multivariable  Student's  t 
distribution  is  the  product  of  [Det(2)]'1/2  and  a  function  of  ttS"1t, 
with  2  being  positive  definite,  and  thus  has  the  property  that  for  a 
fixed  dimension,  n,  and  number  of  degrees  of  freedom,  k,  its  integral 
over  an  ellipsoid  C  defined  as  in  Eq.  5.2. (1)  depends  only  on  p  and  not 
on  2  or  r) . 

Thus  to  find  critical  values  for  confidence  ellipses  of  the  form 
of  Eq.  5.2.  (1)  for  an  n-dimensional  Student's  t  distribution  with  k 
degrees  of  freedom,  it  suffices  to  know  the  distribution  of  the  scalar 
random  variable,  || 1 1| 2 .  The  variable  ||tj|2/n  has  a  distribution  known  as 
the  F  distribution  with  n  and  k  degrees  of  freedom. 

We  shall  define  the  F  distribution  in  Sec.  5.5,  then  verify  the 
assertion  in  the  last  sentence  above,  and  finally  relate  the  F  distri¬ 
bution  to  the  our  particular  case  of  t  -  (£-£)/£• 


5.5.  The  F  distribution 


If  8 j  £  x^kj]  and  82  e  x2[k2]  then  the  ratio 

^i/kj  k2^i 

^2/^2  ^1^2 

is  said  to  have  the  F  distribution  with  kj  and  k2  degrees  of  freedom. 
This  distribution  will  be  denoted  by  Ffkj.kj], 

The  F  distribution  was  introduced  by  Fisher,  1922,  and  incorpor¬ 
ated  in  a  general  framework  for  testing  linear  hypotheses  by 
Kolodziejczyk,  1935.  Reference:  [Plackett].  Values  of  the  distribu¬ 
tion  function  for  the  F  distribution  are  available  in  tables  and  from 
computer  subroutines. 

Suppose  w  £  N(n;0,I),  82  £  x2!^]  ,  and  t  -  k*/2w /82.  Then  t  is  an 
n- dimensional  Student's  t  variable  with  k:  degrees  of  freedom.  Also 
|] w || 2  is  a  x2[n]  variable.  Let 

l|t||2  k2  ||w||2 

$ - 

n  n  8\ 


Then  rf>  e  F[n,k2]  . 

Continuing  now  our  consideration  of  £/$,  it  was  noted  (See 
Eq.  5.4.  (8))  that  (£*0/$  is  an  n-dimensional  Student's  t  variable  with 
(m-n)  degrees  of  freedom.  It  follows  therefore  that 

IU*$|2/(n^Z)  £  F[n,m-n] .  Thus  critical  points  for  confidence  ellipses 
for  |  when  4>  has  been  estimated  by  #  can  be  obtained  from  tables  for 
the  F  distribution. 


6.  Approaches  to  data  analysis 

6.1.  Analysis  of  one  coherent  set  of  data 

Assume  we  have  a  set  of  m  observed  values,  yA,  i  —  1,  .  .  .  ,  m, 
which  we  will  treat  as  an  m-dimensional  vector.  It  is  assumed  that  y 
can  be  regarded  as  being  a  random  sample  from  the  normal  distribution, 
N(m;»j  ,^2H) ,  where  H  is  a  known  mxm  positive  definite  matrix  and  17  is 
unknown.  We  think  that  the  value  of  ^  is  1  but  we  will  estimate  4>  from 
the  data  as  a  check  on  the  validity  of  the  model. 

It  is  assumed  that  r)  has  a  representation  of  the  form 

(1)  r,  -  B£ 

where  B  is  a  known  mxn  matrix  of  rank  n,  and  (  is  an  unknown  n-vector. 
Our  objectives  are  to  estimate  (,  obtain  a  covariance  matrix  for  the 
estimator  of  (  as  a  measure  of  the  dispersion  of  the  estimator,  and 
estimate  ^  as  a  measure  of  the  quality  of  the  model. 


We  shall  use  the  two  transformations  introduced  in  Sec.  4. 3. 3.1. 
Details  of  the  computational  steps  specified  will  be  found  in  [Lawson 
and  Hanson] . 

Compute  the  Cholesky  factorization  of  H  as  H  —  LLfc.  Then  we  know 
the  minimum  variance  unbiased  linear  estimator,  £,  for  £  is  the  value 
of  x  that  solves  the  least  squares  problem  of 

(2)  minimizing  ||L'1Bx  -  L_1y||2 

Compute  [C:z]  -  L'1[B:y]  by  solving  the  system  L[C:z]  —  [B:y] . 

Then  our  least  squares  problem  is  to 

(3)  minimize  ||Cx  -  z||2 

Compute  the  "QR"  factorization  of  [C:z],  This  gives  an  mxro  ortho¬ 
gonal  matrix,  Q,  and  an  (n+l)x(n+l)  upper  triangular  matrix,  U, 
satisfying 

(4)  [ C :  z ]  -  Q  Jj 
Partition  the  matrix  U  as 

<5>  U  -  [o  a] 

where  R  is  an  nxn  nonsingular  upper  triangular  matrix,  g  is  an  n- 
dimensional  column  vector,  0  denotes  an  n-dimensional  row  vector  of 
zeros,  and  a  is  a  scalar.  The  transformed  least  squares  problem  is  now 
that  of 


(6) 

or  equivalently 

(7) 


minimizing 


#-  E 


minimizing  |Rx  -  g||2  +  a2 


Since  R  is  nonsingular,  the  first  term  in  this  expression  will  be 
reduced  to  zero  by  the  unique  x  that  satisfies 

(8)  Rx  -  g 

and  the  minimum  value  of  the  expression  being  minimized  is  just  a2. 
Summarizing  these  steps,  we  may  write 

(9)  §2  -  min,  (Bx-y^H  ^(Bx-y) 

-  min,  |jL‘1Bx  -  L'xy||2 

-  min,  || Rx  -  g||2  +  q2 


with  the  minimizing  value  of  x  being  f  that  satisfies 

27 


Our  unbiased  estimate  of  £  is  £ .  Our  unbiased  estimate  of  4>2  is 
$2  -  S2/(m-n) .  Confidence  intervals  for  $2  are  available  by  use  of  the 
X2  distribution  as  illustrated  in  Sec.  5.3. 

A 

The  covariance  matrix  for  £  is 

(11)  Cov(0  -  <t>zV 
where 

(12)  V  -  (BtH’1B)'1  -  (RtR)'1  -  R'1R't 


6.2.  Combining  sets  of  data 

The  total  set  of  data  defining  the  problem  discussed  in  Sec.  6.1 
consists  of  the  vector  y,  and  the  matrices,  B  and  H.  Suppose  two  sets 
of  data,  (yt,  Bj ,  Hj)  and  (y2,  B2  H2) ,  have  been  acquired  that  are  both 
assumed  to  derive  from  the  same  underlying  parameter  vector,  £.  Thus, 
for  i  -  1,  2,  it  is  assumed  that  yi  is  a  noisy  observation  of  Bj£  with 
covariance  matrix,  ^2HA.  Furthermore  we  assume  the  observations  yx  and 
y2  are  independent,  in  the  sense  that  Cov(y1,y2)  -  0. 

Suppose  the  processing  described  in  Sec.  6.1  has  been  applied  to 
each  of  these  data  sets,  obtaining  estimates,  and  f2,  as  well  as  all 
of  the  intermediate  and  auxiliary  quantities  defined  in  Sec.  6.1. 

We  wish  to  consider  the  question  of  selecting  from  among  these 
intermediate  and  auxiliary  quantities  those  that  will  be  useful  in  the 
computation  of  an  estimate  of  £  based  on  the  combined  data  sets,  and 
also  to  specify  how  these  selected  quantities  are  to  be  used  to  obtain 
such  an  estimate. 

Supplementary  quantities  deriving  from  data  sets  1  or  2  will  be 
indicated  by  the  symbol  used  in  Sec.  6.1  subscripted  with  1  or  2, 
respectively.  Quantities  based  on  combined  data  will  be  indicated  by 
the  symbol  of  Sec.  6.1  with  no  subscript.  Thus  we  may  begin  by 
defining 


The  combined  problem  can  be  characterized  as  that  of  minimizing 
the  quantity  S2  defined  by 

(4)  S2  -  (Bx-y)tH‘1(Bx-y) 

-  (B1x-y1)tHj1(B1x-y1)  +  (B2x-y2)tH21(B2x-y2) 


- 1  to11  *  -  [fjr + ip-  •  fe 


where 


-  life:  •  ill! 


The  problem  of  minimizing  S2  is  now  seen  to  be  a  linear  least 
squares  problem  involving  the  matrix  ft  and  the  vector  g.  A  reasonable 
approach  to  solving  this  problem  would  be  via  the  "QR"  decomposition  as 
in  steps  6.1.  (3)  -  (8).  This  "QR"  decomposition  can  be  written  as 


[ft:g]  -  00 


>[o* !] 


where  0  is  a  (2n+2)x(2n+2)  orthogonal  matrix,  ft  is  an  nxn  nonsingular 
upper  triangular  matrix,  g  is  an  n-vector,  0  denotes  an  n-dimensional 
row  vector  of  zeros,  and  a  is  a  scalar.  We  then  have 

(7)  S2  -  ||ftx  -  g||2  +  a2 

A 

so  the  the  solution  vector  £  is  given  as  the  solution  of 


a*  -  i 


and  the  minimum  value  of  S2  is 


S2  -  Q2 


The  covariance  matrix  of  {  i 


where 


(11)  -  (fttft)’1  -  ft^ft'1 

Summarizing  the  above  process  we  see  that  the  upper  triangular 
matrix,  U,  of  Eq.6.1.(5)  can  be  chosed  as  the  quantity  to  save  for  each 
data  set  for  use  in  later  combining  the  data  sets.  The  process  of 


combining  data  sets  then  amounts  to  stacking  the  saved  U-matrices 
vertically,  as  in  Eq.(5),  and  then  computing  the  "QR"  decomposition  of 
this  augmented  matrix,  as  in  Eq.(6),  to  obtain  the  U-matrix  for  the 
combined  problem.  The  combined  covariance  matrix  can  be  computed  from 
the  R-submatrix  of  the  U-matrix,  as  in  Eq.(ll). 

6.2.1  Additional  remarks  on  combining  data  sets 

An  alternative  way  to  approach  the  minimization  of  the  last 
expression  in  Eq.  6.2.  (A)  is  to  note  that  the  minimizing  vector  ?  is 
the  unique  solution  of  the  normal  equations 

(1)  ft**?  -  ft*i 

A 

Thus  the  covariance  matrix  for  can  be  expressed  as 

(2)  1  -  (fttft)-1  -  (RtR,  +  R^R;,)'1  -  (V'1  +  V'1)'1 

A 

and  £  can  be  expressed  as 

(3)  |  -  (fttftr^g 

-  *(R|gl  +  R^g2) 

-  +  R^R2|2) 

-  (V-,1  +  v2i)'1(v-11?1  +  V^|2) 

It  is  at  least  of  mathematically  aesthetic  interest  to  note  that 
Eqs.(2)  and  (3)  can  be  written  as 

(4)  T1  -  vi1  +  V^1 
and 

(5)  r'S  -  vi1?,  +  v^l, 

Eqs.(2)  and  (3)  show  that  as  an  alternative  to  the  approach 
suggested  in  Sec.  6.2  of  using  the  U-matrix  as  the  object  to  be  saved 
and  updated  as  data  sets  are  combined,  one  could  instead  use  the  £- 
vector  and  the  V-matrix.  These  could  then  be  updated  as  data  sets  are 
combined  using  Eqs.(2)  and  (3). 

The  use  of  U  gives  much  better  numerical  stability  and  reliability 
than  the  use  of  V  and 

We  assumed  at  the  beginning  of  Sec.  6.1  that  B  was  of  rank  n.  In 
computational  practice  it  is  essential  to  recognize  that,  even  though 
the  matrix  B  may  be  of  rank  n,  if  its  column  vectors  are  nearly 
linearly  dependent,  i.e.,  if  it  has  a  large  condition  number  (see  Sec. 
4.1),  its  performance  in  a  computational  process  may  be  more  like  a 
matrix  of  lower  rank  For  this  reason  it  is  important  to  consider  what 
will  happen  to  a  computational  procedure  if  B  is  ill  conditioned  or  has 
rank  less  than  n. 


The  computation  of  the  QR  decomposition  of  a  matrix  is  well 
defined  and  numerically  stable  regardless  of  the  rank  or  condition 
number  of  the  matrix  being  factored.  Thus  using  the  U-matrix  as  the 
fundamental  object  in  processing  separate  data  sets  and  combining  them 
is  a  numerically  stable  process.  If  the  R-submatrix  of  the  U-matrix  at 
some  stage  is  ill-conditioned  or  rank-deficient  then  the  quantities  £ 
and  V  computed  from  the  U-matrix  at  that  stage  are  likely  to  be  poorly 
determined  or  undeterminable.  This  only  affects  the  £  and  V  at  this 
stage,  however.  If  this  U-matrix  is  later  combined  with  a  U-matrix 
derived  from  another  set  of  observations,  it  is  possible  that  the  R- 
submatrix  of  the  new  U-matrix  will  be  better  conditioned  and  reasonably 
accurate  values  of  £  and  V  can  be  determined. 

The  U-matrix  computed  from  some  particular  set  of  data  will  be 
essentially  the  same  regardless  of  whether  it  is  computed  from  all  the 
data  at  once  or  through  the  combining  of  U-matrices  previously  computed 
for  disjoint  subsets  of  the  data. 

In  contrast  to  the  stability  of  the  U-matrix  approach,  the 
approach  based  on  Eqs.(2)  and  (3)  is  highly  dependent  on  the  order  in 
which  data  is  grouped.  Error  in  the  V-matrix  at  any  stage  will 
propagate  to  all  following  stages.  The  method  fails  completely  if  the 
V-matrix  cannot  be  determined  at  some  stage  due  to  rank-deficiency  of 
the  underlying  B-matrix,  even  though  after  more  data  is  accumulated  one 
might  have  a  full-rank,  and  even  well-conditioned,  underlying  B-matrix. 

Updating  methods  based  on  Eqs.(2)  and  (3)  were  given  in  the  early 
papers  on  "filtering"  or  "Kalman  filtering"  in  the  late  50' s  and  early 
60' s.  Instability  of  the  type  described  here  was  recognized  as  a 
problem  in  those  days.  The  QR  decomposition  was  not  widely  known  and 
understood  in  the  early  60' s,  but  by  the  late  60' s  it  was  being  used  in 
many  algorithms  of  linear  algebra.  Its  value  in  "filtering"  was 
increasingly  appreciated  in  the  early  70’ s.  A  systematic  treatment  of 
"filtering”  emphasizing  the  use  of  the  U-matrix  is  given  in  [Bierman] . 
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Generalized  inner  product  10 

Ill-conditioned  11 

Independently  distributed  13 

Inner  product  9 

Joint  frequency  function  7 

Least  squares  15 

Left  identity  15 

Left  inverse  14 

Linear  transformation  9 

Matrix  9 

Mean  5,  12,  20 

Minimum  variance  7,  14,  15 

Moment  19 

N(* , *)  20 

N(m ;  rj , H)  20 

Nonnegative  definite  10 

Nonsingular  9 

Norm  9 

Normal  distribution  20 
Normal  equations  16 
Orthogonal  10 
Orthonormal  10 
Outer  product  9 
Positive  definite  10 
Positive  semidef inite  10 
Probability  20,  21 
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Affine  transformation  10 

Probability  density  function  4,  12 

QR  factorization  11 

Random  variable  4 

Rank  9 ,  13 

Full  9 

Rank-deficient  9 

Singular  value  decomposition  11 

Singular  values  11 

Spectral  norm  10 

Standard  deviation  5,  20 

Student's  t  24 

Symmetric  10,  12 

Unbiased  7,  14,  15 

Var(x)  6 

Variance  5 

Vector  9 

Well-conditioned  11 
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